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Abstract 

Neuroevolution is an active and growing research field, especially in 
times of increasingly parallel computing architectures. Learning meth- 
ods for Artificial Neural Networks (ANN) can be divided into two groups. 
Neuroevolution is mainly based on Monte-Carlo techniques and belongs to 
the group of global search methods, whereas other methods such as back- 
propagation belong to the group of local search methods. ANN's comprise 
important symmetry properties, which can influence Monte-Carlo meth- 
ods. On the other hand, local search methods are generally unaffected by 
these symmetries. In the literature, dealing with the symmetries is gen- 
erally reported as being not effective or even yielding inferior results. In 
this paper, we introduce the so called Minimum Global Optimum Distance 
principle derived from theoretical considerations for effective symmetry 
breaking, applied to offline supervised learning. Using Differential Evo- 
lution (DE), which is a popular and robust evolutionary global optimiza- 
tion method, we experimentally show significant global search efficiency 
improvements by symmetry breaking. 



1 Introduction 

Artificial Neural Networks (ANN) are general function approximators [ll] and 
can be used to find a functional representation of a data set. Another point 
of view is that ANN's represent a way of data compression |2l. The compres- 
sion ratio depends on the number of neurons used in the ANN which encodes 
the data: the less neurons at the same representation quality, the better the 
compression. 

The estimation of the ANN-parameters is generally a computationally de- 



manding task 23 . The corresponding Maximum-Likelihood derived cost func- 
tion comprises many local optima. Therefore, local search techniques to find an 
optimal solution generally fail and only a suboptimal solution is found, which 



is a local optimum II . In addition, local search techniques are mainly se- 



quential methods and parallel implementations are limited. On the other hand, 



global optimization techniques based on Monte-Carlo methods such as the Ge- 
netic Algorithm (GA) [6,19 , Covariance Matrix Adaptation Evolution Strate- 
gies (CMA-ES) [9)[l0] or Differential Evolution (DE) [20)[24)[28] are generally 
very well parallelizable. Differential Evolution is one of the most popular and 
robust Monte-Carlo global search methods, which outperforms many other evo- 
lutionary algorithms on a wide range of problems [3, 27, 29 1. DE is successfully 
used in many engineering problems such as multiprocessor synthesis 
timization of radio network designs 
multi layer neural networks 



13 



21 



18 , training RBF networks 
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and many others. 

Due to inherent symmetries in the parametric representation of ANN's, there 
are also multiple global optima. The multiple global optima result from point 
symmetries and permutation symmetries 25p6 . The effect of these symmetries 
on Genetic Algorithms is reported to be minimal and negligable |8 . However, 
there are no reports on the impact of the ANN-symmetries regarding the DE 
method. In this paper, we show that DE is very sensitive to multiple global 
optima. We derive a symmetry breaking operator based on theoretical consid- 
erations, which is optimal according to a Minimum Global Optimum Distance 
condition. In experimental studies on offline supervised learning problems, a 
significant improvement of up to two orders of magnitude is achieved by sym- 
metry breaking in terms of global convergence speed. Comparisons to CMA-ES, 
which is a state-of-the-art evolutionary method for ANN- learning [7j22:, show 
that CMA-ES is outperformed on complex learning problems using smaller net- 
works which represent better compression. 



2 Brief Review of Artificial Feedforward Neural 
Networks 

We deal with Artificial (Feedforward) Neural Networks (ANN) for approxima- 
tion of functions / : [—1, l] d — > [—1, l] 9 , having L layers (one input layer, L — 2 
hidden layers and one output layer) and Ni sigmoid type neurons per hidden 
layer I. For each neuron (I, n), we denote a parameter vector by 

*li = («»n.7-n). C 1 ) 

where w l n is the weight vector and T l n is the shift scalar. The output of a 
tanh-type sigmoid neuron (I, n) is calculated by 

x l n =t a nh(w l Jx 1 - 1 +T l n ), (2) 

where x l = (x\, . . . , x l N ) is the output vector of layer I. After all hidden layers 
I = 2,3, ...,L — 1 are evaluated, the final output component y n of the output 
vector y is calculated by 

y n = w^x L -\ n = l,..,q. (3) 
We denote the parameter vector of all neurons in a layer I by A', where 

\ l = (r,[,...,r, l Nl ). (4) 
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The total parameter vector of the whole network is given by 

e a = (\ 2 ,...,X L -\w^...,w^), (5) 

where w L = (wf , . . . ,wj^ ) is the vector of the output layer weights. The 
function defined by the network is denoted by 

y = n{8 a -x), (6) 

where x is the input vector, which equals to the output of the input layer, so 
that x 1 = x. 

Assuming additive normal i.i.d. noise on the available data (x k , j/k), k = 1, K , 
the ML-estimate 6 a of the parameters a is determined by the least squares 
solution: 

K 

a = argmin^(y fc - £l(0 a ; x k )) T (y k - Q(0 a ;x k )). (7) 

" k=l 

Since the output layer is linear as shown in Eqn. j3|, the corresponding weights 
wm.u can be determined by a least squares method, as described in [17], which 
we adopt in this paper. This has the advantage that global search is applied only 
to the non-linear part of the parameter space, which generally speeds up con- 
vergence. Consequently, the parameter vector for global optimization 6 consists 
of 

e = (\ 2 ,...,\ L - 1 )- (8) 

In the following section, we briefly review the DE method. 



3 Brief Review of Differential Evolution 

DE is one of the best general purpose evolutionary global optimization methods 
available. It has only linear complexity and it is known as an efficient global 
optimization method for continuous problem spaces. The optimization is based 
on a population of N p solution candidates Oi,i £ {1, N p } where each candi- 
date has a position in the D-dimensional search space. Initially, the solution 
candidates are generated randomly according to a uniform distribution within 
the provided intervals of the search space. The population improves by gener- 
ating new positions iteratively for each candidate. For each individual 0,,Gi a 
new trial position m^g is determined by 

Vi, G = e ruG + f ■ (e r2tG - e r3 , G ) (9) 
u ltG = c(e itG , v itG ), (10) 

where r 1; r 2 , r 3 are pairwise different randomly chosen integers from the discrete 
set {1, ...,N p } and F is a weighting scalar. The vector Vi iG is used together with 
Oi : G in the crossover operation, denoted by C(). The crossover operator copies 
coordinates from both Bi Q and Vi a in order to create the trial vector Ui G- 
C is provided with the probability C r to copy coordinates from 6i t Q, whereby 
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coordinates from Vi t G are copied with a probability of 1 — C r to u^g ■ Only if the 
new candidate it^G proves to have a lower cost then it replaces 0i,G, otherwise 
it is discarded. 



DE includes an adaptive range scaling for the generation of solution can- 
didates through the difference term in Equation This leads to a global 
search with large step sizes in the case where the solution candidate vectors are 
widely spread within the search space due to a relatively large mean difference 
vector. In the case of a converging population, the mean difference vector be- 
comes relatively small and this enables efficient fine tuning at the final phase of 
the optimization process. The crossover operator has a complicated role in the 
dynamics of the population. For example, it produces rotations that are very 
important when dealing with separable variables. In some cases, it can help to 
increase the diversity of the population or it can also speed up the convergence, 
depending on the problem. 



4 Symmetries in ANN's 

A symmetry is an operator $ which applies to the parameter vector 6 of the 
ANN and leaves the output of the ANN invariant: 

0(0; a;) =O($(0);x) V0,x. (II) 

Non reducable ANN's comprise two types of symmetries. The first type is a 
point symmetry on the neuron parameter level, since 

iwtanh(x) = — iutanh(— x) \lw,x. (12) 

It follows, that a point symmetry operator O l n defined by 

applied to the parameters of neuron (/, n) and the n-th weight component w 1 ^ 

to all neurons in the following layer I + 1 does not change the output of the 

ANN. In Fig. [I] an example for the application of 0\ is given. For each layer, 

the point symmetry yields 2 Nl symmetric equivalents of the parameter vector 6 

due to the point symmetries. 

The second type of symmetry is a permutation symmetry by the neuron 

parameters rj and corresponding weight parameters in the next layer. A per- 
>z 



mutation operator P\ k defined by 



^■■{^J&'.i-i *,„ (14) 

leaves the output invariant. Note that Pj k = P [ k j- In Fig. [5J the application 
of Pf 2 = Pi i is illustrated. In each layer, there are JVj! symmetric equivalents 
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Input 




Layer 4 



Output 



Figure 1: Application of the point symmetry operator 0\, which changes the 
signs of r)\ -parameters in layer two and 1 -parameters in layer three, respec- 
tively. 



of the parameter vector 8 due to the permutation symmetries. The total count 
of symmetric equivalents per layer I is 2 Nl Ni\. Another important property is 
that the length of the vector 8 is invariant under such symmetry operators: 

||$(0)|| = ||0|l V*. (15) 

As a result, all symmetric equivalents of a global optimum lie on a hypersphere. 
Since such symmetries multiply the local and global optima count in the param- 
eter space, the ultimate goal of symmetry breaking is to reduce the total number 
of local optima in the parameter space by avoiding all but one symmetrically 
equivalent space partitions. In this case, there are infinitely many ways for sym- 
metry breaking by using the operators O l n and Pj k . The differences arise from 
the condition on which these operators are to be applied. Fig. [3] shows different 
ways for breaking a point symmetry in relation to the global optimum. It can 
be seen that the symmetry invariant region which has maximum distances to 
the global optima enables the optimal separation or partitioning. This way, 
an optimal isolation between all symmetric equivalents of the global optima is 
achieved. As a result, the influence of other neighbouring global optima is de- 
creased to a minimum, which maximizes the attraction of the global optimum 
of the selected partition. It can be easily shown that such a symmetry invariant 
region is the result of the following separation condition 

*_ f 8 for \\8-8\\<\\-8-8\\ 

\ 8 for ||0-0|| > || -8-8\\, { > 

where 8 denotes the selected global optimum, 8 the parameter vector be- 
fore symmetry breaking and 8 the parameter vector after symmetry break- 
ing. Taking into account the additional permutation symmetry, which is inde- 
pendent from the point symmetry, the operator $ is generally composed of a 
chain of point symmetry and permutation symmetry operators. As an example 
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Input vl \J 
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Output 



»7i 



P;f i, which 



Figure 2: Application of the permutation symmetry operator P{ 2 
exchanges the parameters r}\ «-» T7I in layer two and the parameters 1 <-> wf 2 , 
W2 1 4-> u>2 2 in layer three. 



$ = (9| o P| 4, applies a permutation symmetry followed by a point symmetry 
operator. We denote the set of all possible symmetry operators by S. We gen- 



eralize the separation condition ( 16 ) and define the following ideal separation, 



or, in other words, symmetry breaking 



$ = argmin||$(0) - 011, = $(0). 



(17) 



This means, the ideal separation selects the symmetry operator $ from the set of 
all possible symmetry operators which minimizes the distance of the parameter 
vector to a global optimum 0. 

4.1 Approximation of the ideal separation 

There are two problems for the practical application of the ideal separation. 
First, the global optimum is not known a priori. However, iterative algorithms 
like DE produce intermediate results at each iteration, which can be regarded 
as an approximation of the global optimum. This approximation becomes bet- 
ter with increasing iteration number. Therefore, we propose to choose the best 
individual of the DE-population at each iteration as an approximation for the 
global optimum. Second, the brute force method for finding an optimal solution 



to ( 17 ) has exponential complexity. Instead of trying to find the optimal symme- 



try breaking operator, we propose to apply only one single symmetry operator 
O n or Pj k at a time. The symmetry operator, layer and neuron are randomly 

selected. The proposed heuristic only approximates $, but this approximation 
improves with the iteration count. For each neuron (l,n), we define a symmetry 
relevant parameter block [3 l n as 



l+l 

+1 



, L - 2 and (3. 



L-l 



(.Vt 1 )- (18) 
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Separating line 




Figure 3: Example for a point symmetry in 2-D, where f{9) — f(—9) V0. The 
plots show worst case (left), suboptimal (middle) and optimal separation lines 
(right) for point symmetry breaking. The separating line divides the parameter 
space in two parts, where each partition contains a global optimum (0 and —9). 



Given a parameter vector 9 and an approximation of the global optimum 9 with 
corresponding parameter blocks (3 l n , the pseudocode describes the proposed 
heuristic. A property of this heuristic is that the symmetry is not completely 
and not uniquely broken and the resulting modified parameters may belong to 
different partitions over time. As also 9 changes over time, the final convergence 
result will be a random global optimum among all other possible global optima. 

The proposed symmetry breaking method is always applied on each individ- 
ual's position 9i t c (see Eqn. (10)) at each iteration prior to the application of 
Eqn. (fl0|). 



5 Experiments 

In this section, we introduce results of experiments to demonstrate the per- 
formance improvements by symmetry breaking. The following methods are 
compared in regression and classification tests: Differential Evolution (DE), 
Covariance Matrix Adaptation Evolution Strategies (CMA-ES) and DE with 
symmetry breaking (DE-SB). With a £>-dimensional parameter space, all tests 
are performed with following settings: 

• DE, DE-SB settings: F — 0.5, C r = 0.9, initial population is randomly 
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Algorithm 1 Proposed heuristic for breaking symmetry. Parameters (J,,l,n,m 
are sampled from corresponding discrete uniform distributions U(.). A symme- 
try operator X is only applied to the parameter vector when it decreases the 
distance to the global optimum 0, i.e., \\X{6) — 9\\ < \\9 — 0\\. Algorithm input: 
9 and 9, effect: (eventually) modify parameter vector 9 . 

symmetry operator selection: sample fi ~ U({0, 1}) 
hidden layer selection: sample I — U({2, ...,L — 1}) 
neuron selection: sample n — U({1, . . . , Ni}) in hidden layer I 
if fi = then 

[point symmetry operator selected] 

? 

// would the point symmetry operator O n decrease the distance? (||O^(0) — 0\\ < 

\\0-0\\) 

calculate distance-square for NOT applying the operator O l n : Di = ||/3„ — (3 l n \\ 2 
calculate distance-square for applying the operator O l n : D 2 = 1 1 — (3 l „ — jj n \ | 2 
if D l > D 2 then 

apply point symmetry operator O l „: set /3„ = — (3 l „ 
end if 
else 

[permutation symmetry operator selected] 

second neuron selection: sample m = U({1, . . . , Ni}) in hidden layer I 

? 

// would the permutation operator n decrease the distance? (\\P}n n (0) — 0\\ < 

\\0-0\\) 

calculate distance-square for NOT applying the operator P}n,n'- Di = \\0n ~ 

fti\\ 2 + \\P\ n -P l m \\ Z 

calculate distance-square for applying the operator Pm,n'- D2 — Wfln — f3 l m \\ 2 + 

W0L-0LW 2 

if Di > D 2 then 

apply permutation symmetry operator P l m<n : swap f3 l m o /3„ 
end if 
end if 



generated in D-dim. hypercube [— 1, 1] D (uniformly), 

• CMA-ES settings: we used suggested settings for enhanced global search 
abilities, mentioned in the C-code reference implementation. 

Given a parameter and a data set (x i: y,), the ANN produces a cost e defined 
by the Mean Squared Error (MSE), which is derived from Eqn. Q: 

1 

x (y k ~n(0;x k )) T {y k -n{0;x k )). (19) 



K ■ q ^ 

y k=l 

In order to limit the D-dimensional parameter space to a feasible region, 
we apply a penalty approach. Due to the length-invariance by the symmetry 



operators as shown in Eqn. ( 15 ), the fea sibl e region is defined by a hypersphere 



In case of | \0\ | > v D, the cost function ( 19 ) is evaluated at a rescaled parameter 
vector |i|n and a penalty term 5O(||0|| — \/D) is added to the cost e. 
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5.1 Regression problems 

All utilized target functions are defined over x e [— 1, l] d and map to [—1,1]. A 
data set, which consists of training and test samples, is generated by sampling 
Xi from a rf-dimensional uniform distribution U d (— 1, 1) 

Xi~C^(-l,l), (20) 

and adding normal distributed noise with zero mean and variance a 2 to the 
function values yi 

y l = f(xi) + n, n ~ Af(0, (J 2 ), a = 5xl(T 3 . (21) 

All data sets of regression problems contain 200 training samples and 200 test 
samples. Tab. [T] shows the utilized functions and corresponding data sets for 
regression tests. The population size N p depends on the problem and is manually 
adapted accordingly. For each problem, the global optimization is applied in 50 
independent runs and the mean required number of ANN-evaluations (MFE) 
and corresponding standard deviation omfe to reach the error threshold eo are 
determined. The population size manually is adapted such that no one of the 
50 runs fails to reach the error threshold and the MFE is kept minimal. We 
declare a global optimum as found by reaching the error threshold. We dchnc 
a robustness p as 

number of successfull runs 

P= • (22) 

total number of runs 

Results for the most complex settings (least number of hidden neurons) are 
shown in Tab. [2] Fig.[4]shows results for the required number of ANN-evaluations 
(MFE) over the number of hidden neurons used in the ANN. 

5.2 Classification problems 

The following classification problems are used: iris data set [4][5] , tic-tac-toe 
data set [l], balance [l2][l4] and a more challanging problem defined by the 
two-spirals [15] data set. Samples are divided into a training set and a test 
set following the format a/b. The selection process iteratively and sequentially 
puts a samples into the training set and the following b samples into the test 
set, until no samples are available. A winner-takes-all scheme is applied to 



Table 1: Target funtions, corresponding data sets, error thresholds and typical 
test set MSE's of function regression experiments. There was no significant 
variation of the MSE results by the choice of the learning method. 







trai 


ling / test samples 


e 




test 




2t MSE 




(as + 0.5) 2 (0.1 + (a> + 0.65) 2 ) 


syn5 




200 / 200 


5xl0~ 


5 


6 


27x10" 


5 


± 3.2x10" 


li 


sin(10x)/(10x) 






200 / 200 


5x10" 


5 


5 


99x10" 


5 


± 7.6x10" 


6 


x/2 + sin(10aj)/(10a0 






200 / 200 


5xl0~ 


5 


5 


88x10" 


5 


± 7.1x10" 


6 


si„(5r)/(15r), r = ^x 2 + y 2 


sinc2d 




200 / 200 


5xl0~ 


5 


5 


74x10" 


5 


± 6.7x10" 


6 
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Table 2: Results of Mean ANN (Function) Evaluations (MFE) from 50 indepen- 
dent runs and population sizes of regression tests. Better values are highlighted 
in boldface. Note that on sine (1-5-1) and incsinc (1-5-1). CM A-ES failed to 
find the global optimum on some runs, even though a large population size of 
10000 was used. On sinc2d (2-3-1-3-1), CMA-ES failed in all 50 runs. 



data set 


topology 


DE 

[population size, robustness] MFE 


DE-SB 

[population size, robustness] MFE 


CMA-ES 
[population size, robustness] MFE 


MFE (DE) 
MFE (DE-SB) 


syn5 


1-3-1 


[SO, 1]7.30X10 4 ± 1.8X10 4 


[80, l]3.20xl0 4 ± 9.4x10^ 


[160, 1J1.17X10 4 ± 2.6X10^ 


2.3 


sine 
sine 


1-5-1 
1-6-1 


[1400, l]9.22xl0 7 ± 3.1X10 7 
[800, l]4.19xl0 7 ± 2.2xl0 7 


[160, 1]4.47X10 5 ± 2.8X10 5 

[60, l]1.52xl0 5 ± 4.5X10 4 


[10 4 , 0.36] 3. 78x10° ± 3.7xl0 5 
[100, 1]2.67X10 4 ± 5.3X10 3 


206 
276 


incsinc 


1-5-1 
1-6-1 


[1600, 1]1. 36x10 s ± 6.0X10 7 
[800, l]4.36xl0 7 ± 2.6X10 7 


[200, 1]8.04X10 5 ± 1.8X10 5 

[56, l]1.18xl0 5 ± 5.0X10 4 


[10 4 , 0.8]7.76xlO b ± 1.9xlO b 
[100, 1]2.86X10 4 ± 8.2X10 3 


169 
370 


sinc2d 


2-3-1-3-1 


[120, l]1.93xlO b ± 4.2xl0 4 


[120, 1]1.22X10 5 ± 2.4X10 4 


[10 4 , 0] NA 


1.58 



sine 




Number of hidden neurons 



incsinc 




Number of hidden neurons 



Figure 4: Required mean function evaluations (MFEs) to find the global opti- 
mum depending on the number of hidden neurons N2 ■ The corresponding net- 
work topology is (1 — N2 — !)• 
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distinguish different classes. The output vector y of a sample designating class 
i has the following format 



As in the case of function regression experiments, on each classification problem, 
a predefined error threshold eo is used as a termination criterion for the learn- 
ing process. Tab. [3] shows the error thresholds and typical classification sucess 
rates. Tab. [4] shows the results on the classification data sets. Comparing DE 



Table 3: Data sets, number of training and test samples, the sample selection 
format, corresponding error thresholds and typical classification sucess rates of 
classification experiments. 



data sot 




ing / test samples 




e 


test set classification success (%) 






75/75 


1/1 


0.011 


99.4 ± 0.6 


t ic-t ac-t oe 




479/479 


1/1 


0.08 


92.1 ± 0.8 






312/313 


1/1 


0.001 


100 ± 


two-spirals 




97/97 


2/2 


0.07 


91.2 ± 1.4 



Table 4: Results and population sizes of classification tests. Better values are 
highlighted in boldface. Note that on balance, CM A-ES failed to find the global 
optimum on some runs, even though a large population size of 10000 was used. 
On two-spirals (2-10-1-10-2) and two-spirals (2-9-1-9-2), CMA-ES failed in 
all 50 runs. 



data set 


topology 


DE 

[population size, robustness] MFE 


DE-SB 

[population size, robustness] MFE 


CMA-ES 
[population size, robustness] MFE 


MFE (DE) 
MFE (DE-SB) 




4-3-3 


[40, l]1.46xl0 4 ± 3.3xlO d 


[40, l]1.24xlO d ± 2.5xlt) d 


[20. l]3.74XlO a ± 4.2X10^ 


1.18 


tic-tac-toe 


9-8-2 


[230, l]2.70xlO e ± 4.7X10 5 


[160, l]9.74xl0 5 ± 2.2X10 4 


[800, 1]3.29X10 5 ± 3.9X10 4 


2.78 




4-5-1-5-3 


[520, l]1.38xl0 / ±9.2x10° 


[160, 1]8.06X10 5 ± 3.2X10 5 


■ ■ ■ 


17.1 


two- spirals 


2-10-1-10-2 

2-9-1-9-2 


[200, 1J2.24X10 7 ± 4.8XlO d 
[240, l]4.26xl0 7 ± 1.5X10 7 


[120, 1]6.74X10 6 ± 2.5X10 6 
[120, 1]8.40X10 6 ± 2.6X10 6 


[10 4 , 0] NA 
[10 4 , 0] NA 


3.32 
5.07 



and DE-SB, symmetry breaking consistently improves global search efficiency in 
all experiments. On reducable networks with a larger number of hidden neurons, 
as in sine and incsinc, CMA-ES is superior and capable of robustly finding a 
solution. However, on networks with the smallest number of hidden neurons 
representing best compression, DE-SB outperforms CMA-ES. Furthermore, on 
complex problems with deeper network topologies, CMA-ES seems to have dif- 
ficulties to robustly find the global optimum, even with a very large population 
size. On these type of problems, the true global search character of DE pays 
off. 

6 Conclusions 

It is shown that symmetries in ANN-parameter space do affect the performance 
of Differential Evolution (DE). From theoretical considerations, we derive an 
ideal operator for breaking these symmetries. This ideal operator requires 
knowledge about the global optimum of the parameter space. Since the global 
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optimum is not known a priori, the ideal operator is not applicable. Another 
concern is that a brute force implementation of the ideal operator has expo- 
nential complexity. Therefore, we propose a heuristic to approximate the ideal 
operator, which has negligable overhead. Unlike the CMA-ES method, which 
has at least quadratic complexity, the proposed DE with symmetry breaking 
(DE-SB) has linear complexity and is generally applicable on very high dimen- 
sional parameter spaces. 

Experimental studies on a priori fixed topology networks indicate a signifi- 
cant improvement over standard DE in terms of required mean number of ANN- 
evaluations. Compared to CMA-ES, which is a state-of-the-art method, we 
achieve superior results especially on complex problems and smaller networks, 
which represent a better compression at comparable approximation quality. We 
believe that other global optimization methods may also significantly benefit 
from symmetry breaking. Specifics of how each method should realize a sym- 
metry breaking heuristic requires further research. 
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